There are huge finances that are involved while buying a property. In order to make that investment in the right time and right property, it is very important to understand the factors that drives the price and more important what would be the forecast price in future for that property.One of biggest player in the real state industry is Zillow who deals in buying, selling and renting a home.
We have median weekly sales data for the houses sold in San Francisco region from 1st week of Jan 2018 to 1st week of Jan 2023.
Code
sales_train %>%ggplot() +geom_line(aes(ds,y)) +xlab("Year Week") +ylab("Median Sales Price")+ggtitle("Line Plot of Median Sales Price")+theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple'))
We can see multiplicative seasonality from the plot.
Seasonality ( Up trend during summer months and downtrend during holiday period)
Mean value is increasing over the year
Summary Stats
Mean/Median
Code
summary(sales_train$y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
846500 922500 973972 967392 998724 1099750
Standard Deviation
Code
sd(sales_train$y)
[1] 58959
Mean of sales price is $971991 and median is $977750 with standard deviation of $58875.3
Distribution of Timeseries
Code
hist <- sales_train %>%ggplot() +geom_histogram(aes(y),colour =4, fill ="white", bins =20) +theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple'))dens <- sales_train %>%ggplot() +geom_density(aes(y),colour =4, fill ="white") +theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple'))violin <- sales_train %>%ggplot() +geom_violin(aes("", y),colour =4, fill ="white") +theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple')) boxplot <- sales_train %>%ggplot() +geom_boxplot(aes("", y),colour =4, fill ="white") +theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple')) ggarrange(hist, dens, violin,boxplot , labels =c("A", "B", "C","D"),ncol =2, nrow =2)
We don’t see any outlines in the data
Mean and Median are very close.
Data is evenly distributed and is normally distributed
Moving Average of Timeseries
Code
sales_ma_data <- sales_train %>%arrange(ds) %>%mutate(ma_13_left =rollapply( y,13,FUN = mean,align ="left", fill =NA ),ma_13_right =rollapply( y,13,FUN = mean,align ="right", fill =NA ),ma_13_center =rollapply( y,13,FUN = mean,align ="center", fill =NA ) ) %>%mutate(value_ma_3 =rollapply(y, 3, FUN = mean, align ="center", fill =NA),value_ma_5 =rollapply(y, 5, FUN = mean, align ="center", fill =NA),value_ma_7 =rollapply(y, 7, FUN = mean, align ="center", fill =NA),value_ma_13 =rollapply(y, 13, FUN = mean, align ="center", fill =NA),value_ma_25 =rollapply(y, 25, FUN = mean, align ="center", fill =NA),value_ma_49 =rollapply(y, 49, FUN = mean, align ="center", fill =NA) )sales_tbl_pivot <- sales_ma_data %>%pivot_longer(cols = ma_13_left:value_ma_49,values_to ="value_ma",names_to ="ma_order" ) %>%mutate(ma_order =factor( ma_order,levels =c("ma_13_center","ma_13_left","ma_13_right","value_ma_3","value_ma_5","value_ma_7","value_ma_13","value_ma_25","value_ma_49" ),labels =c("ma_13_center","ma_13_left","ma_13_right","value_ma_3","value_ma_5","value_ma_7","value_ma_13","value_ma_25","value_ma_49" ) ))x <- sales_tbl_pivot %>%filter(!ma_order %in%c("ma_13_center","ma_13_left","ma_13_right","value_ma_7","value_ma_49" ) ) %>%mutate(ma_order =case_when( ma_order=='value_ma_3'~'3rd Order', ma_order=='value_ma_5'~'5th Order', ma_order=='value_ma_13'~'13th Order', ma_order=='value_ma_25'~'25th Order') ) %>%mutate(ma_order =factor( ma_order,labels =c('3rd Order','5th Order','13th Order','25th Order'),levels =c('3rd Order','5th Order','13th Order','25th Order') ) ) %>%ggplot() +geom_line(aes(ds, y), size =1) +geom_line(aes(ds, value_ma, color = ma_order), size =1) +scale_color_discrete(name ='MA Order')+theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple'))+ylab('Median Sales Price') +xlab('Year Week')+ggtitle('Moving Average Graph')x
We are trying to see the moving average based on the 3rd,5th,13th and 25th order. From the graph 13th order moving avg. explains the seasonality others will be either over fitting or under fitting.
As we have observed seasonality we are going ahead with STL decomposition. We are able to explain the seasonality but still we see a cyclic trend in the residual as it moves up and down.We won’t be able to certify it as a white noise.
Section 2
ARIMA Modeling
Rolling SD of Differenced Series
As we see from the above plot we can clearly see mean and SD variance. We will try to check the variance after taking the rolling difference.
Code
sales_diff <- sales_train %>%mutate(value_diff = y -lag(y)) %>%as_tsibble(index=ds)sales_diff %>%mutate(diff_sd = zoo::rollapply( y, FUN = sd, width =12, fill =NA)) %>%ggplot()+geom_line(aes(ds,diff_sd))+geom_smooth(aes(ds,diff_sd),method='lm',se=F)+theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple'))+ggtitle("Standard Deviation of Differenced Sales Price, over Time") +ylab("SD of Differenced Sales Price") +xlab("Year Week")
`geom_smooth()` using formula = 'y ~ x'
After doing the rolling SD of differences series we can see negligible variance, based on the output we don’t need to do log transformation or BoxCox transformation.
Seasonal Differenced
As its a property sales price data and we have observed seasonality, in the next step we will check the seasonal difference. We are using a value difference of 1 year and taking 1 lag after that as seasonality was still present after 1 difference.
Based on the data from the lag plots and stationary we can assume ARIMA(2,1,0)(1,1,0)
BIC Model Comparision
Build and Compare Some Models by BIC, based on the below result set. Model 7 with ARIMA (2,1,2) is the best model. So our estimation and ARIMA predication are same.
suppressMessages({ model =prophet(sales_train) future =make_future_dataframe(model,periods =27, freq ='weeks') forecast =predict(model,future)plot(model,forecast)+ggtitle("Default Facebook Prophet Model")+ylab("SF Housing Sales Prices")+xlab("Date")+theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple'))})
Interactive Visualization
Code
dyplot.prophet(model,forecast)
Prophet Decomposition
Code
prophet_plot_components(model,forecast)
The sales price has exhibited a steady upward trend since mid-2021, which is mainly attributed to the post-Covid-19 ease in the housing market. Additionally, there appears to be a seasonal pattern in the sales price, with prices rising during the summer months, specifically in June, and declining during the fall and end of the year.
Hyper Parameters
By adjusting the hyper parameters and trying multiple combinations we have got a realistic fitted line. We have adjusted changepoint.range = 0.90,n.changepoints=15,changepoint.prior.scale=0.1 parameters.
Defining the min and max point help the users help check the wild values in the plot and for will be very useful in stock price forecasting when the decision(sell/buy) need to be taken on price value. In our case its not very revelant.
Code
suppressMessages({sixmnths_future_df =make_future_dataframe(model,periods =27)sixmnths_forecast_df =predict(model,sixmnths_future_df)# Set "floor" in training datasales_train$floor =850000sales_train$cap =1500000future$floor =850000future$cap =1200000# Set floor in forecsat datasixmnths_future_df$floor =850000sixmnths_future_df$cap =1200000sat_model =prophet(sales_train,growth='logistic')sat_six_mnths_forecast =predict(sat_model,sixmnths_future_df)plot(sat_model,sat_six_mnths_forecast)+ylim(850000,1200000)+theme_bw()+xlab("Year Week")+ylab("Median Sales Price")})
Seasonality Identification
Code
prophet_plot_components(model,forecast)
The sales price has exhibited a steady upward trend since mid-2021, which is mainly attributed to the post-Covid-19 ease in the housing market. Additionally, there appears to be a seasonal pattern in the sales price, with prices rising during the summer months, specifically in June, and declining during the fall and end of the year.
From the plot we can infer that it shows multiplicative seasonality.The amplitude of the seasonal fluctuations changes over time, the seasonality is multiplicative.
Holiday Impact
As its San Francisco sales data, lets check it by using the US holiday calender with the model.
We don’t see much holiday impact in the plot, couple of towers are present but its not something that is present in all years. We should not consider the holiday component for our time series.
Section 4
Model Comparison and Validation
In the model selection and validation we will compare the different models that we have created and will evaluate on similar metrics such as RMSE, MAPE etc. After the comparison we will select the best model and will run it on test data.
Cross-Validation
In-Sample Cross-Validation to Compare SNAIVE vs ARIMA vs Prophet Models
Based on the plot we can see that prophet model is close to the actual line. We can compare the accuracy by other metrics.
Code
cv_data <- sales_train %>%stretch_tsibble(.init =54*3, .step =27*1)cv_forecast <- cv_data %>%model(naive_w_season =SNAIVE(y),arima_mod =ARIMA(y~pdq(2,1,2)+PDQ(1,1,0)),prophet_mod = fable.prophet::prophet(y ~growth("linear", changepoint_range =0.90,n_changepoints=15,changepoint_prior_scale=0.1) +season("year", type ="multiplicative")), ) %>%forecast(h =26)z <- cv_forecast %>%as_tsibble() %>% dplyr::select(-y) %>%left_join( sales_train, by ="ds" ) %>%ggplot()+geom_line(aes(ds,y))+geom_line(aes(ds,.mean,color=factor(.id),linetype=.model))+scale_color_discrete(name='Iteration')+ggtitle('SNAIVE vs ARIMA vs Prophet Forecast for Each CV Iteration')+theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple'))+ylab("Median Sales Price")+xlab("Year Week") z
.model .type ME RMSE MAE MPE MAPE
1: arima_mod Test 6112.045 42480.69 34184.34 0.5067585 3.349510
2: naive_w_season Test -19964.464 64302.86 55980.49 -2.3080673 5.665352
3: prophet_mod Test 23190.711 49520.15 36109.11 2.1636915 3.509316
MASE RMSSE ACF1
1: 0.6345882 0.6411619 0.9533924
2: 1.0392057 0.9705244 0.9593905
3: 0.6703191 0.7474086 0.9352478
arima_mod model has a RMSE score of 102430.18 on the Test data, which suggests that its predictions are off by an average of that much compared to the actual values in the test set. Similarly, the prophet_mod model has a MAPE score of 3.524022 on the Test data, which suggests that its predictions have an average absolute percentage error of that much compared to the actual values in the test set.
Below we can check the RMSE of all models in T+1, T+2…
RMSE
Code
cv_forecast %>%group_by(.id,.model) %>%mutate(h =row_number()) %>%ungroup() %>%as_fable(response ="y", distribution = y) %>%accuracy(sales_train, by =c("h", ".model")) %>%ggplot(aes(x = h, y = RMSE,color=.model)) +geom_point()+geom_line()+theme(plot.title =element_text(size =20, face ="bold"),plot.subtitle =element_text(size =16),legend.title =element_text(size =14),legend.text =element_text(size =12),panel.background =element_rect(fill ='lightblue', color ='purple'))+ggtitle('RMSE of Each Model at Different Intervals')+ylab('Average RMSE at Forecasting Intervals')+xlab('Year Week in the Future')
Based on the RMSE plotted above we can select ARIMA model as the best model for our test data. For the prophet model we have data gap that’s why the value has come down drastically.
We will go ahead with ARIMA pdq(2,1,2)+PDQ(1,1,0) as the best model.